-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add method fold_with to Producer and UnindexedProducer #184
Conversation
This allows parallel producers to customize the inner loop. This is a temporary measure until iterators gain this natively (one proposal is the fold_ok method). This is marked WIP because I haven't been able to use this on UnidexedProducer parallel iterators to make a real difference. This is because for I do have success using ParallelIterator with UnindexedProducer if I do something heavy enough. For example: if the lower limit for can_split is 128 elements and the operation is Replacing that expensive function call with something that autovectorizes has big gains: fn fastexp(x: f64) -> f64 {
let x = 1. + x/1024.;
x.powi(1024)
} However, only if the can split threshold is hardcoded to a larger limit. (Something like at least 10 000 elements at least, it turns out, for “fastexp”). |
As an example of potential gains, the PR enables fastexp to be autovectorized in ndarray (with the parallel traits implemented, not yet integrated into ndarray), for a case like this: let mut a = Array2::<f64>::zeros((M, N));
let mut a = a.slice_mut(s![.., ..-1]);
a.view_mut().into_par_iter().for_each(|x| *x = fastexp(*x)); The inner loop customization is needed because the array is not contiguous (due to the slicing), yet each row is contiguous, so the inner loop can still be subject to loop optimizations if we circumvent the general
The WIP parallel iterator implementation is here: https://github.com/bluss/rust-ndarray/compare/par-iter...par-iter-elements?expand=1 |
Hi @bluss -- very cool! I will attempt to take a look soon. :) |
break; | ||
} | ||
} | ||
folder.complete() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be more useful to leave complete
out of this. At least for ChainProducer
, that would let it call fold_using
separately on its two parts, recursively if there are multiple levels of chains, then just let the bridge functions complete it at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat. I know that idea :D rust-lang/rust#37315
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I knew I had seen it somewhere! 😄
I might bikeshed this to something like As for tuning parameters, note that we're trying to move away from
This shouldn't always happen unless your workload is very small. The "thief" splitter works by first splitting roughly to the number of job threads, and then each time a job is stolen that gets split to the number of threads again. Towards the end of task, there probably will be a lot of eager stealing and thus splitting, maybe down to the minimum, but for most of a busy workload they should stay relatively large. Automatic tuning is hard -- if you have input on manual tuning parameters, please chime in! |
One important piece of tuning turns out to be (maybe not surprisingly) using an explicit configuration to use one thread per core (so not counting hyperthreading). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with this, but I'll leave it open for @nikomatsakis to take a look.
Sure. I'll squash this. I removed the WIP tag since it is an improvement when using a thread per core. |
I'll squash when you think it's ready, I mean. |
where F: Folder<Self::Item>, | ||
{ | ||
for item in self { | ||
folder = folder.consume(item); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point, we discussed whether we should have Producer: UnindexedProducer
, but we didn't have a good reason for it (even though it logically makes sense). The existence of this duplicate code might provide such a reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm new to the internals, maybe split_at can be removed and only split is left?
bridge_producer_consumer (indexed case) is this today:
let mid = len / 2;
let (left_producer, right_producer) = producer.split_at(mid);
let (left_consumer, right_consumer, reducer) = consumer.split_at(mid);
It could change to something like:
let (left_producer, right_producer) = producer.split();
let mid = left_producer.len();
let (left_consumer, right_consumer, reducer) = consumer.split_at(mid);
so the difference is that an indexed producer can tell its length (and that it has a meaningful length or before / after ordering of the halves).
ndarray's unindexed producer can tell the length of each part perfectly. What sets it apart is that (1) it can't split exactly at len / 2 since it must split in chunks of some axis' size (2) It doesn't want to split so that it has a "before" and "after" half. It prefers an unordered split for performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About order:
Take a 3 × 4 array
[[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 0, 1, 2]];
The logical order of the regular .iter()
is defined to be "logical" row major, which means the elements in the order we read them: 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2. The memory order of the array is separated from the logical order of .iter()
.
For parallelism, we guarantee no order and the unindexed producer for an array view will pick either of these two splits, depending on which preserves memory locality:
[[1, 2, | 3, 4],
- - - - - - - - - v.split_at(Axis(0), 1) // best for row major memory order
[5, 6, | 7, 8],
[9, 0, | 1, 2]];
\ v.split_at(Axis(1), 2) // best for column major memory order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you need split_at
in order to handle zip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(That said, I did want to consider changing the structure so that the producer can offer a suggestion about where to split instead. But I haven't decided on the best way to do it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the key distinctions are:
- can split at an arbitrary point
- needed for
zip
- needed for
- knows exactly how many elements it will produce
- needed for precise consumers
Today we tie those together, but they could be separated. Something like:
- UnindexedProducer
- can split itself at some point
- KnownLengthProducer: UnindexedProducer
- can tell you the number of items it will produce
- Producer: KnownLengthProducer
- can further be split at an arbitrary point
we might then have two bridge methods:
- one takes a
KnownLengthProducer
and aConsumer
- this is the one that today requires a
Producer
- it would instead invoke
producer.split()
, get resulting length, and invokeconsumer.split_at()
, as you suggested
- this is the one that today requires a
- one takes a
UnindexedProducer
and aUnindexedConsumer
- same as today
We would probably have to tweak the ParallelIterator
hierarchy a bit too, but the idea is that zip()
would require that its sub-iterators can produce a full Producer
(or, potentially, that the LHS can produce a KnownLengthProducer
and the RHS can produce a full Producer
). It's split()
implementation can then call left.split()
and right.split_at(left_split.len())`.
Make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that makes sense. I'm happy with the two traits that are now; since the UnindexedProducer here didn't want to participate in zip in any way even if it can tell its exact length, the reason is that it doesn't produce elements in any particular order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point, we discussed whether we should have Producer: UnindexedProducer, but we didn't have a good reason for it (even though it logically makes sense). The existence of this duplicate code might provide such a reason.
Yes, although IIRC we were also wanting impl<T: Producer> UnindexedProducer for T
to avoid duplicating can_split
/split
boilerplate, but that would really want specialization. Plus that boilerplate would never actually be invoked.
In the case of just fold_while
, perhaps we could make a BaseProducer
common to both, so the boilerplate is a one-line empty impl
in most cases. Or we could extract this body to a shared free function just to avoid the duplication. But this is only 6 lines, let's not overthink it. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd vote for not more trivial traits to implement, unless they do something noticeable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yet another option is to add something like:
trait Folder {
// ...
fn consume_iter<I:IntoIterator>(self, iterator: I) -> Self {
// ...
}
}
... which might be useful to folks specializing fold_with
too.
This method allows the producers to customize the inner loop. This is useful for iterators that can provide a more efficient inner loop than the default for loop. Example: multidimensional arrays.
Squashed commits together. |
OK, I propose we merge this as is, but experiment with the producer hierarchy a bit. |
Will this be in a point release, like 0.6.1 ? |
@bluss presumably |
Rayon is making big strides anyway, and that's exciting. |
@bluss I'll push a new release ASAP, no reason to wait I guess |
This method allows the producers to customize the inner loop. This is
useful for iterators that can provide a more efficient inner loop than
the default for loop.
Example: multidimensional arrays.